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ABSTRACT 

A popular approach for comparing gene expression 
levels between (replicated) conditions of RNA se- 
quencing data relies on counting reads that map 
to features of interest. Within such count-based 
methods, many flexible and advanced statistical ap- 
proaches now exist and offer the ability to adjust for 
covariates (e.g. batch effects). Often, these meth- 
ods include some sort of 'sharing of information' 
across features to improve inferences in small sam- 
ples. It is important to achieve an appropriate trade- 
off between statistical power and protection against 
outliers. Here, we study the robustness of existing 
approaches for count-based differential expression 
analysis and propose a new strategy based on ob- 
servation weights that can be used within existing 
frameworks. The results suggest that outliers can 
have a global effect on differential analyses. We 
demonstrate the effectiveness of our new approach 
with real data and simulated data that reflects prop- 
erties of real datasets (e.g. dispersion-mean trend) 
and develop an extensible framework for comprehen- 
sive testing of current and future methods. In addi- 
tion, we explore the origin of such outliers, in some 
cases highlighting additional biological or techni- 
cal factors within the experiment. Further details 
can be downloaded from the project website: http: 
//imlspenticton.uzh.ch/robinson_lab/edgeR_robust/. 

INTRODUCTION 

RNA sequencing (RNA-seq) is widely used for numerous 
biological applications, including the detection of alter- 
native splice forms, ribonucleicacid (RNA) editing, allele- 
specific expression profiling, novel transcript discovery but 
most commonly, for detecting changes in expression be- 
tween experimental conditions or treatments. Compared to 
microarray technology, RNA-seq offers an open system, 
higher resolution, lower relative cost and less bias (1). A typ- 



ical RNA-seq experiment includes: (i) capture of an RNA 
subpopulation (e.g. polyA-enriched, depleted of ribosomal 
ribonucleicacid) from cells of interest; (ii) reverse transcrip- 
tion into complementary DNA (cDNA); (iii) preparation 
and sequencing of millions of short cDNA fragments (~200 
bp); (iv) mapping to a reference genome or (assembled) 
transcriptome; (v) counting according to a catalog of fea- 
tures. This last counting step can be conducted by exclud- 
ing ambiguous reads between genes (2), or with advanced 
tools that portion ambiguous reads to transcripts (3) or can 
be done in combination with assembly tools (4). The focus 
here is on methods for count-based differential expression 
(DE) analyses and the robustness thereof; thus, the starting 
point here is a count table of features-by-samples, such as 
those available from the ReCount project (5). 

Considerable recent effort has been paid by the statis- 
tical community to the discovery of DE features, given a 
count table; recent comparisons have shown that no method 
dominates the spectrum of possible situations (6,7). RNA- 
seq remains expensive and in many cases researchers are 
studying precious samples or rare cell types, so the num- 
ber of biological replicates is often limiting. It is clear that 
the most successful methods implement some form of 'in- 
formation sharing' across the whole dataset to improve DE 
inference (2), and this becomes an intricate exercise to trade- 
off power, false discovery control and protection against 
outliers. To highlight this distinction, we describe two pop- 
ular software implementations for the negative binomial 
(NB) model, which arguably is the de facto standard for 
accounting for biological variability in such genome-scale 
count datasets. The latest version of edgeR moderates dis- 
persion estimates toward a trended-by-mean estimate (8), 
whereas DESeq takes the maximum of a fitted dispersion- 
mean trend or the individual feature-wise dispersion esti- 
mate (9). The effect imposed on features with 'outliers' is 
illustrated in Figure 1 . Ten randomly selected samples from 
individuals from the HapMap project (denoted as Pickrell 
(10)) are divided into two groups of 5, forming an artificial 
'null' scenario. While very little true differential expression 
is expected, a low rate of false detections occur; in partic- 
ular, edgeR detects a small number of genes with low esti- 
mated false discovery rate that exhibit one or two observa- 
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Figure 1. From Pickrell (10) data, 10 randomly selected samples from in- 
dividuals are divided into two groups of 5, forming an artificial 'null' sce- 
nario, (a), (b) and (c) show barplots of log-counts-per-million (CPMs) of 
three genes from the top 10 DE genes with one or two extremely large ob- 
servations. Dashed lines represent group-wise average log-CPMs. (d) and 
(e) plot genewise biological coefficient of variation (BCV) against gene 
abundance (in log2 counts per million) for edgeR and DESeq. In panel (d), 
gray dots show unmoderated biological BCV estimates (vWvW) (prior 
degrees of freedom = 0). Steel blue dots show moderated biological BCV 
with prior degree 10 (default setting for edgeR). Three outlier genes on (a), 
(b) and (c) are labeled by large blue dots. For (e), DESeq uses the maximum 
(steel blue dots) of a fitted dispersion-mean trend (red line) or the individ- 
ual feature-wise (tagwise) dispersion estimate. Three outlier genes are also 
pointed out by large blue dots. 



tions that are generally much higher in expression (Figure 
la-c). We believe that there are two causes for this: (i) the 
sensitivity of relative expression estimates to these 'outlying' 
observations; (ii) moderation of the dispersion estimates to- 
ward the trend. In contrast, DESeq remains largely unaf- 
fected by these outliers, since the dispersion estimation pol- 
icy is to keep the maximum; in what follows, we will explore 
the effect of this maximum policy on power. All computed 
statistics for this dataset are stored in Supplementary Table 
SI. 

The downstream effect of these dispersion estimation 
strategies suggest: (i) DESeq is generally conservative but 
robust; (ii) edgeR can be sensitive to outliers when there 
is sufficient dispersion smoothing toward the trend (effec- 
tively underestimating the dispersion in the shrinking pro- 
cess), but should be more powerful in the absence of such 
extreme observations (2). Our goal in the current study is 
to achieve a suitable middle ground, perhaps forfeiting a 
small amount in statistical efficiency, similar to established 
robustness frameworks, to reduce the influence of extreme 
observations in differential expression calls. As hinted above 
and in general, robustness is not solely determined by the 
dispersion parameter, but also by controlling the influence 
of outliers to other parameters in the model (e.g. those rep- 
resenting changes in expression). We explore these aspects 
in both simulated and real data, provide a extensible frame- 
work for evaluating the tradeoffs and highlight some in- 
stances of biology or technical factors that may give out- 
liers. 

The literature is rich in alternatives for count-based DE 
analyses and in particular, dispersion estimation, yet it re- 
mains increasingly difficult to assess the performance across 
the range of possibilities. For example, recent evidence sug- 
gests that one can suitably transform count data and ana- 
lyze with methods developed for microarrays, with special 
treatment (11). The mainstream strategy is to directly fit 



count data to extensions of the Poisson model and in partic- 
ular, the NB model. Many implementations are available as 
R/Bioconductor packages (12), such as edgeR (13), DESeq 
(9), ShrinkBayes (14), baySeq (15) and variations of disper- 
sion estimation that can be used within existing implemen- 
tations (16); the main differences lie in the estimation of the 
dispersion or in the inference machinery (e.g. Bayesian ver- 
sus frequentist). Recent comparisons and summaries of the 
methods available can be found in (2), (6) and (7). 

Some early and existing count-based DE analysis tools 
only allowed two-group comparisons. That is, they could 
not handle more complex situations, such as paired sam- 
ples, time courses or batch effects. Recently, McCarthy et 
al. developed generalized linear model (GLM) capabilities 
in edgeR (8), allowing a much broader class of experimen- 
tal designs to be analyzed and other frameworks have fol- 
lowed suit. However, GLMs require iterative fitting and 
more complicated dispersion estimation machinery (8). As 
shown in Figure 1, this framework can suffer a lack of 
robustness, whereby even a single extreme value (outlier) 
could largely affect estimates of regression parameters (e.g. 
mean of experimental condition), as highlighted by recent 
comparative studies (6) (see also Figure 1). In addition, the 
moderation of the dispersion parameter toward a trended 
value is actually contributing to the lack of robustness, forc- 
ing the dispersion to be underestimated (Figure 1). DESeq2 
(successor of DESeq) takes an altogether different stance on 
robustness: using a Cook's distance metric, features that ex- 
hibit an extreme value are not considered for downstream 
statistical testing. 

The strategy proposed in this paper is that of 'observation 
weights', effectively down-weighting outliers to dampen 
their influence. There is already some precedent for do- 
ing this in GLM settings: Carroll and Pederson (17) in- 
troduced weighted maximum likelihood estimators for the 
logistic model; Cantoni (18) presented a robust quasi- 
likelihood approach for inference in binomial and Poisson 
models; Agostinelli and Alqallaf (19) derived weighted like- 
lihood equation for GLMs by directly inserting 'observa- 
tion weights' into iterative re-weighted least squares algo- 
rithm (IRLS). Of particular importance, after adding ob- 
servation weights, the asymptotic theory suggests that like- 
lihood ratio statistics of model parameters still converge to 
approximate chi- squared distributions under the null hy- 
pothesis (20). At present, no 'off-the-shelf robust approach 
is readily available for the negative binomial model in the 
context of genome-scale computations. In this paper, we 
build an outlier-resistant framework that maintains high 
power and achieves decent false discovery control and make 
it available in the edgeR software package; the same strat- 
egy could be employed in other frameworks. We benchmark 
its performance on real and simulated data and explore the 
origins of outlying observations. 

MATERIALS AND METHODS 

A standard setup of NB model in GLM framework 

To most easily explain the addition of observation weights, 
we follow closely the notation used in McCarthy et al. (8). 
Let the Y gi be the read count in sample i for feature g (g = 1 , 
. . . , G). Assume Y gi follows a NB distribution with mean jji gi 
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and dispersion 0 g , denoted by Y gi ~ NB(^-, 4> g ). Feature g's 
variance equals /jb gi + 4> g • /x g / 2 , while the dispersion 0 g rep- 
resents the square of the 'biological coefficient of variation' 
(8). In the GLM setting, the mean response, \i gU is linked 
to a linear predictor, here with the canonical logarithm link 
according to: 



log(Ai g/ ) = 1% + log TV}, 



(1) 



where Xis the design matrix containing the covariates (e.g. 
experimental conditions, batch effects, etc.), fi g is a vector 
of regression parameters (a subset of which are of interest 
for differential expression inference) and TV, is the (effective) 
library size for sample i. 

For estimation of the regression parameters, maximum 
likelihood estimation is used. The derivative of the log- 
likelihood, l(P g ), with respect to the coefficient fi g is X T zz g , 
where zz gi = (y gi — /%•)/( 1 + The estimated value of 

P g can be obtained by the IRLS in the form: 



P*™ = pf + (X 1 Q g X) 



1 X T z 



(2) 



where X T Q g X is the Fisher information matrix (also de- 
noted below as T g X g ) and Q g is the diagonal matrix of work- 
ing weights, which are jJL g /{\ + 4>gt^g) f° r the NB model. 

Moderated and trended dispersion estimates 

The adjusted profile likelihood (APL) introduced by Cox 
and Reid (21) has shown good performance for dispersion 
estimation in the context of genome-scale count data (8,22). 
The APL g is a likelihood in terms of <j> gi penalized for the 
estimation of the regression parameters, p g , as follows: 



AFL g ((P g ) = £((/> g ;y g , p g ) ■ 



\ log \T g | 



(3) 



where y^y^ is the vector of counts for gene g, P g P g is the es- 
timated coefficient vector, €(•) is the log-likelihood function, 
1 g X g is the Fisher information matrix and I - 1 is the determi- 
nant. The early strategy to accomplish moderation for the 
dispersion was by squeezing the tagwise dispersion toward a 
common dispersion that is estimated over all features (23). 
This weighted likelihood approach involves maximizing a 
linear weighting of the individual likelihood and the com- 
mon (averaged) likelihood, the two terms, respectively, in 

argmax JaPL^) + a ■ i APL ^) J ' ( 4 ) 

where a is a suitably chosen weight. 

A slight variation on this, which is now commonly 
applied after experience in many datasets showing a 
dispersion-mean relationship, is to shrink toward a disper- 
sion estimated from features with similar average expression 
level (8). This so-called trended dispersion is constructed us- 
ing local shared log-likelihood for feature g (more precisely, 
a smooth fit to common dispersions that are calculated in 
bins of averaged counts per million) and its neighboring fea- 
tures in terms of expression strength. Specifically, individual 
tagwise estimates for each feature can be estimated by max- 
imizing a linearly weighted function between individual dis- 
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Figure 2. The flow chart of the robust algorithm implemented in edgeR. 
ftp is the estimated GLM regression coefficient and 4>(j) is the moderated 
dispersion estimate by maximizing APLww (Equation (10)). r g t is the Pear- 
son residual corresponding to count y g i from Equation (l).ww g i is the ob- 
servation weight from Equation (8). LRT (glmLRT in edgeR) computes 
likelihood ratio tests using the weights. 



persion and local shared dispersion: 

4 = argmax {APL g (0 g ) + y ■ APLf(0 g )} , (5) 

where 4> g 4> g is moderated tagwise dispersion, y is the prior 
degree of freedom afforded to the shared likelihood and 



APLf(^)=i-^APU(^), 



(6) 



where the set C represents features that are close to feature 
g in average log counts per million. 



A robust negative binomial GLM 

Our approach to induce robustness is to attach a weight to 
each observation; observations that deviate strongly from 
the model fit are given lower weight. In particular, Pear- 
son residuals from the current fit are sent through a weight 
function, which gets passed to the next iteration of esti- 
mation. The dispersion estimation machinery (i.e. trended 
APL) also receives the same observation weight, so that the 
influence of outliers is dampened on both the regression and 
dispersion estimates.The robust iterative estimation proce- 
dure using weights is described in Figure 2. The Pearson 
residual of an observed count y gi from the NB GLM fit can 
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be calculated as 



ygi l^gi 



(7) 



w 



where figifigi is the fitted value (from fifi) and 4>g4> g is the 
moderated dispersion estimate. The Pearson residual is con- 
verted to weights using, e.g. the Huber function: 

for abs(r^) > k 
for abs(r^) < k 

where k represents a tuning constant for Huber estimator 
and is usually set to 1.345 in normally distributed settings 
to achieve 95% efficiency (24). This weight, ww gi , gets used 
in the next iteration of GLM fitting; the IRLS equation be- 
comes: 



ri = w(r gi ) = 



W-new 



W-old 



+ (X T [W g Q g ]X)- l X T [W g ]z g 



(9) 



where W g is the diagonal matrix of observation weights 
for feature g. The Fisher information matrix with observa- 
tion weight becomes Xf = X T [W g Q g ]XX^ = X T [ W g Q g ]X. 
In this approach, the APL for dispersion <p g with observa- 
tion weights can be written as 



APLf(^) 



l w (4> g ;y g J g )- l -log\l^\, (10) 



where i w {-) = J2i ww giK') is the weighted log-likelihood 
function and X^X^ is the Fisher information matrix with 
observation weights. Then, using these dispersion estimates, 
the regression parameters are estimated, again using the ob- 
servation weights. 

For users of edgeR, only a small change in the standard 
pipeline is required. 

A simulation framework with parameters based on the joint 
distribution of mean and dispersion estimates from RNA-seq 
data 

We built a simulation framework that aims to accurately re- 
flect the reality of RNA sequencing data. In order to evalu- 
ate the performance of our robust method and other meth- 
ods across a variety of reasonable conditions, we created 
several options: 

(1) nTags: total number of features, 

(2) group: factor containing the experimental conditions, 

(3) pDiff: proportion of DE features, 

(4) foldDiff: relative expression level of truly DE features, 

(5) pUp: proportion of DE features that increase in expres- 
sion, 

(6) dataset: dataset to take model parameters from, 

(7) pOutlier: proportion of outliers to introduce, 

(8) outlierMech: outlier generation mechanism to use. 

We generate true NB model parameters, fi and 0, using 
the joint distribution of estimates, ft ft and 00, estimated 
using edgeR from real datasets, such as the published count 
tables at ReCount (5): Pickrell (10), Cheung et al. (25,26). In 
particular, the joint distribution preserves the dispersion- 
mean trend, which can vary from dataset to dataset. After 



the removal of extremely high dispersions and low means 
(analogous to typical recommended filters; see Supplemen- 
tary Figure SI), the derived-from-real-data parameters are 
used to simulate the counts, from a NB distribution and op- 
tionally with true DE. 

To test robustness, we add outliers to the simulated 
counts. Outliers are large values and can be produced by two 
different mechanisms (outlierMech): first, counts are multi- 
plied by a random factor between 1.5 and 10, as employed 
by Soneson and Delorenzi (6), and includes both the 'sim- 
ple' (S) and 'random' (R) method. In S, a gene is chosen at 
some probability to have a single outlier randomly added. In 
R, each observation can become an outlier with some prob- 
ability. In the second mechanism, called 'model' (M), each 
observation can become an outlier with some probability 
and if so, is sampled from a second NB distribution with 
larger \i (original \i multiplied by random factor between 
1 .5 and 10); R and M methods induce the same overall out- 
lier rate. 

Recently, van de Wiel et al. modeled genome-scale count 
data as zero-inflated negative binomial model (ZINB), 
which seemed to explain some of the dispersion-mean re- 
lationship (4). We have not considered simulations from 
ZINB distributions, since they do not appear to explain 
all of the observed dispersion-mean relationship in the 
datasets that we tested (see Supplementary Figure S2). 



Methods compared 

We evaluated and compared several methods for DE anal- 
ysis, including edgeR, edgeR-robust, limma-voom, DESeq- 
pool, DESeq-glm, DESeq2, baySeq, SAMseq (27), EBSeq 
(28) and ShrinkBayes; the performance evaluation system 
that we developed allows arbitrary additions (assuming 
they are implemented in R). limma-voom is an extension 
to DE analysis of RNA-seq count data from limma (11); 
it transforms the count data with special treatment given 
to fitting the mean-variance relationship. DESeq is tested 
as two separate methods: DESeq-pool is the default setting 
method to estimate the empirical dispersion from all the 
conditions with replicates; DESeq-glm fits models accord- 
ing to a design matrix and estimates dispersion by maximiz- 
ing APL. edgeR, DESeq and DESeq2 differ in how the dis- 
persion is estimated: edgeR moderates dispersion toward a 
trended estimate (8), edgeR-robust expands this with ob- 
servation weights, DESeq takes the maximum of a fitted 
trend of dispersion or the individual feature-wise dispersion 
estimate (9). DESeq2 offers a zero-mean normal prior on 
the log-fold-changes for moderation and a proper modera- 
tion of dispersion estimates to a trended value, except when 
the feature exhibits variability much greater than other fea- 
tures at the same expression strength; for outlier protection, 
a Cook's distance is calculated and those features with an 
extreme value are not promoted to formal statistical test- 
ing i.e. P-values are set to NA; in our simulations, these P- 
values are set to 1 so as to not remove features. The default 
normalization method is also different among edgeR, DE- 
Seq and DESeq2. edgeR uses trimmed-mean-of-M-values 
(TMM) (29), while DESeq and DESeq2 use a relative-log- 
expression approach. SAMseq, a non-parametric method, 
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employs Wilcoxon rank-sum statistics to estimate false dis- 
covery rate (FDR) through sample permutations. 

baySeq, EBSeq and ShrinkBayes use Bayesian inference. 
baySeq employs the NB model and assumes that samples 
can be classified as different groups by their treatment con- 
ditions; samples within the same group should follow the 
same distribution and share parameters. Using an empiri- 
cal Bayes approach, baySeq estimates the posterior prob- 
ability of the null state. ShrinkBayes introduces the ZINB 
and performs inference using integrated nested Laplace ap- 
proximations (INLA) (30,31) and provides Bayesian FDR 
and local false discovery rate (lfdr) (32) estimates. Since 
the computational cost of ShrinkBayes is high, some com- 
parisons are skipped. EBSeq is similar to baySeq, provid- 
ing posterior probability of DE, as well as EE (equally ex- 
pressed), based on a parametric mixture model. Compared 
with other methods tested here, EBSeq can also detect DE 
isoforms in EE features, yet this is not our primary question 
here. 

Notably, new methods, or variations of existing ones can 
be easily added to our comparison framework, simply by 
providing a wrapper to an R function that contains the cor- 
rect inputs (count table, grouping variable) and outputs (P- 
values). See Supplementary web site for details. 

Comparison metrics 

To test the performance of each DE method in the pres- 
ence of outliers, we employ several standard metrics and 
plots: false discovery (FD) plots, receiver operating charac- 
teristic (ROC) curves, partial ROC curves and power curves. 
Power (TP) curves and (partial) ROC curves (i.e. up to a cer- 
tain false positive rate) evaluate the ability to distinguish, 
through statistical evidence, DE and non-DE. FD proce- 
dures gauge the control of the expected proportion of in- 
correctly rejected null hypotheses (33). Another useful plot 
is the relationship between TP rate and achieved false dis- 
covery rate across multiple thresholds. 

An open graphical tool and R code for re-analysis: evaluating 
DE analysis methods 

One disadvantage of current method comparisons (e.g. 
(6,7)) and those that accompany every new method pub- 
lished, is that they are a snapshot in time. If new meth- 
ods come along, the developer must demonstrate that their 
method is better, by some metric. This task is important 
but somewhat repetitive, because many of the same met- 
rics, plots and simulation models are (re-)implemented. We 
endeavored to create a system for performing standardized 
simulation-based testing. 

In addition, all analyses presented in this paper are freely 
available from our website. Moreover, our simulation and 
evaluation framework is made available as a web-sourceable 
script that consists of three modules: simulation, evaluation 
(running of the software packages) and metric computa- 
tion. Each module can be extended, using simple wrapper 
functions to existing R-based code, ensuring that our com- 
parison results are reproducible, extensible and relatively 
easy for the user to track exactly what code segments (and 
versions) were run. 
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Figure 3. (a) For the random 5 versus 5 split of the Pickrell data (10) 
shown in Figure 1, the trajectories of overall trended dispersion and for the 
three individual genes are shown over six iterations of the edgeR-robust 
re-weighted estimation scheme, (b) A bar plot of miR-133b expression 
from Witten et al. (25), including an observation with very high count, (c) 
weights for miR-133b after six iterations of the re-estimation from edgeR- 
robust. Dashed lines in panel (b) shown the group-wise CPM before and 
after weighting. 



In addition to R code, we make available a web-based 
shiny 'app' that can be used to look at simulation results 
across a wide number of conditions (34). Since there are 
often too many methods to be easily displayed together, 
our app gives users the ability to present results for a user- 
selected subset of methods; the results update automatically 
as the user selects different simulation settings. 

Functional category analysis for outliers 

To explore potential biological or technical factors that 
may manifest as outliers, we performed hypergeometric- 
based functional category analyses on the set of genes with 
weights less than some cutoff (here, set to 1) separately for 
each sample. Our goal with such an analysis is to identify 
possible biological or technical factors that affect a sub- 
set of genes for a particular experimental unit. In some 
cases, this may shed light on why the expression levels of 
some genes for a given sample are very different than that 
of their replicates. Furthermore, we can investigate whether 
the down-weighting is driven by technical factors. As a pos- 
itive control for this, we compared the observed weights to 
the sample- specific guanine-cytosine (GC) effects observed 
in the Pickrell dataset (10,35). 

RESULTS 

edgeR-robust dampens the effect of outliers 

To highlight how edgeR-robust dampens the influence of 
outliers, we return to the dataset shown in Figure 1 . Figure 
3a shows the trajectories for the three outliers in terms of 
their average log-CPM and dispersion estimates and how 
the dispersion-mean trend changes over six iterations of the 
edgeR-robust re-weighted estimation scheme. Although we 
have not studied convergence in depth, Supplementary Fig- 
ure S3 highlights the change in parameter estimate by itera- 
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tion; most features 'converge' after a small number of itera- 
tions and we use a fixed number of iterations as a stopping 
rule. As expected, the outliers appear 'extreme' according to 
the model, as also reflected by their residuals. Extreme resid- 
uals are then down weighted, iteratively, and both the dis- 
persion and average log-CPM estimates are updated (Figure 
3a). In particular, we notice large changes to the regression 
(e.g. log-fold-change) and dispersion parameter estimates, 
which impose better accordance, in terms of dispersion- 
mean relationship, with the other features in the dataset. 
Notably, Figure 3 a highlights a global drop in dispersion- 
mean trend after the iterative robust estimation, which sug- 
gests that outliers present in sufficient frequency may have 
a global effect on the statistical detection of DE within a 
dataset. Thus, we speculate that gains in statistical power 
(see sections below) may be achieved in part by this global 
drop in trended dispersion. 

In their manuscript, Li and Tibshirani (27) show some ex- 
treme examples of outliers affecting differential count anal- 
ysis of miRNA-seq data (in particular, see their Figure 2). 
Figure 3b shows one of those examples, mir-133b, and high- 
lights the estimated mean CPM by group, before and after 
down-weighting; the observation weights after six iterations 
are shown in Figure 3c. Notably, for this example, there still 
exists strong evidence for differential expression, even after 
careful reassessment of the outlying observations. 

Supplementary Table SI gives the full details of these 
analyses, before and after re-weighting. 

Simulation reflects real data 

To test the method on a wide range of simulated settings, we 
first generate count data from a model that reflects real data 
as well as possible. As described in the 'Materials and Meth- 
ods' section, we choose to take the joint distribution of esti- 
mated log-CPM and dispersion from a large dataset as the 
basis for the parameter settings and we use library sizes that 
mimic those from typical datasets. For example, the Pickrell 
dataset (10) consists of >50 replicates, which should repre- 
sent a reasonably accurate reflection of the range of abun- 
dances observed, as well as, in particular, the dispersion- 
mean relationship. We generate all data from the NB model 
and introduce outliers by various mechanisms (see 'Materi- 
als and Methods' section). Supplementary Figure S4 shows 
the dispersion-mean trend for the Pickrell dataset (top left) 
and an example simulated dataset based on the estimated 
parameters (top right), respectively, as well as the marginal 
distributions of both log-CPMs and dispersion. The frame- 
work for these simulations (see 'Materials and Methods' 
section) is designed to take an initial dataset that seeds the 
simulation parameters, so datasets spanning the range of 
biological variation could easily be tested. Notably, we ex- 
plored the Pickrell dataset for both the frequency of outliers 
(as detected by down- weighting; Supplementary Figure S5 
gives cumulative distributions of weights) and the magni- 
tude of the outliers relative to non-down-weighted observa- 
tions (Supplementary Figure S6) to justify the use of sim- 
ulation parameters. In particular, we note that the range of 
outlier deviations is within the range we use (e.g. multiplica- 
tion factor between 1.5 and 10; Supplementary Figure S6). 
Meanwhile, samples from the Pickrell dataset exhibit outlier 



rates of 2-10% 'per sample' (depending on where a weight 
threshold is set), suggesting our choice of 10% (of features 
with a single outlier) is in fact a conservative amount of out- 
liers that may be present. 

Standard metrics across various methods for various simula- 
tion settings 

Next, we present a representative simulation and perfor- 
mance results under a single 'reasonable' setting of the pa- 
rameters. We sampled NB model parameters /x and 0 from 
the joint distribution of estimates from the Pickrell data (10) 
(dataset); we filtered out the top 10% of the extreme disper- 
sion values (analogous to filtering; see Supplementary Fig- 
ure SI); 10 000 features were generated (nTags), with a 5 
versus 5 two-group comparison (group); 10% of them are 
defined as DE genes (pDiff=.l), symmetrically (pUp=.5) 
with fold difference 3 (foldDiff=3); outliers are introduced 
to 10% of the features (pOutlier=.l) using the 'simple' out- 
lier generation mechanism (outlierMech="S"); outliers are 
randomly distributed among all features; further details are 
described in the 'Materials and Methods' section. Original 
simulated counts and the counts with outliers introduced 
are separately recorded and all methods were run on both. 

Figure 4 shows the set of standard metrics: panels (a)- 
(c) and (d)-(f) show false discovery plots, ROC curves and 
power numbers, respectively, for the original and original- 
with-outliers datasets under the setting of simulation pa- 
rameters discussed above. Overall, the introduction of out- 
liers results in more false positives (Figure 4a versus d) 
and/or less true positives at the same false positive rate 
(Figure 4b versus e). In the absence of outliers, all meth- 
ods exhibit similar patterns of false discovery rates, with 
the Bayesian methods, ShrinkBayes and EBSeq having a 
slightly higher rate. Similarly, in terms of separating the 
truly DE from non-DE features using a P- value (or P- value- 
like score in the case of Bayesian methods), all methods 
are very close in performance. Furthermore, in the absence 
of outliers, edgeR, edgeR-robust and DESeq2 appear to 
have a slight edge in power at the method's 5% FDR, albeit 
the advantage is small (Figure 4c). When outliers are intro- 
duced, edgeR-robust shows some advantages over edgeR. 
In terms of statistical power, all methods drop in overall 
power with the introduction of outliers (Figure 4c versus f), 
while DESeq exhibits a spectacular drop. Notably, DESeq 
still maintains a good ranking of P-values (Figure 4f), but 
becomes very conservative due to the maximum- of-trend- 
and-individual dispersion policy; in this respect, presence of 
outliers affect the whole dataset (see Supplementary Figure 
S7). 

Since the direction of differential expression and the out- 
lier introduction are applied at random, we can further split 
the DE features according to the position of the outlier rel- 
ative to the direction of change in abundance (Figure 4g- 
i); 'DEupOutlier' represents the situation where the out- 
lier is added to the higher expressed group; 'DEdownOut- 
lier' represents those features where the outlier was added 
to the lower expressed condition; 'DEnoOutlier' represents 
DE features with no introduced outlier). Notably, edgeR 
shows the highest power in the 'DEupOutlier' setting, but 
this is artificial since the introduction of the outliers ac- 
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Figure 4. (a), (b) and (c) present FD, partial ROC (up to FP rate of 40%) 
and power plots (at each methods' 5% FDR) across several tested methods 
for datasets with no introduced outliers; (d), (e) and (f) show corresponding 
plots with datasets containing 10% outliers (i.e. 10% of genes have a single 
outlier) using 'S' method, (g), (h) and (i) split the results from panel (f) into 
three categories: features without outliers (g); outliers in the higher expres- 
sion group (h); outliers in the lower expression group (i). All power results 
are shown as overall (single dot on the left of the plot) and split across five 
equally-sized average-log-CPM groups. The X on panels (b) and (e) high- 
lights the achieved power (TP) according to each method's 5% FDR cutoff. 
Note that while panel (g) presents the situation with no outliers, there are 
outliers present in other features within the dataset and is therefore differ- 
ent from panel (c). 



tually helps detection. The 'DEdownOutlier' is the situ- 
ation where edgeR-robust comes to the forefront, as ex- 
pected, given that outliers strongly eliminate the differen- 
tial expression. In the absence of outliers, edgeR-robust still 
remains a strong competitor, closely followed by DESeq2, 
ShrinkBayes, limma-voom and edgeR. 

It is also interesting, as a byproduct, to consider how 
well the methods identify outliers. In particular, we com- 
pared edgeR-robust's observation weights (using both Pear- 
son and Deviance residuals) with DESeq2's Cook's distance 
metric (both at observation level and feature-wise maxi- 
mum) to separate the simulated outliers. Supplementary 
Figure S8 shows an ROC curve depicting how well the ob- 
servation weights (and other scores) separate outliers from 
non-outliers. Similarly, the default setting of DESeq2 leads 
to a similar tradeoff between false positives (here, falsely de- 
tected as an outlier) and false negatives (failing to identify 
an outlier) and Pearson residuals appear superior and are 
used for all further analyses with edgeR-robust. Notably, 
the edgeR-robust strategy smoothly identifies outliers and 
down-weights them according to the magnitude of discor- 
dance, instead of setting a hard threshold where statisti- 
cal tests are no longer conducted. One byproduct of DE- 
Seq2's hard threshold is a loss of power (e.g. Figure 4 pan- 
els h and i), since genes with true differential expression as 
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Figure 5. Power-to-achieved-FDR across hard (foldDiff e [2, 2.2]), 
medium (foldDiff e [3, 3.3]) and easy (foldDiff e [6, 6.6]) simulation set- 
tings, (a) No outliers; (b) 10% outliers. Y-axis shows TP rate and X-axis 
shows FD rate. Five simulations are shown for each method and each set- 
ting. Points are taken according to each method's FDR cutoffs at 0.02, 0.05 
and 0.1. 



well as outliers are excluded from statistical testing. We also 
tested DESeq2 after turning off the Cook's distance metric, 
which results in an expected sensitivity to outliers (Supple- 
mentary Figure S9). Although the focus has been on higher- 
in-magnitude outliers and indeed that is what we see more 
of (Supplementary Figure S6), lower outliers can be suf- 
ficiently detected and down- weighted (e.g. Supplementary 
Figure S10). 

A shiny app to display pre-computed simulation results 

The above discussion was in regard to a single dataset un- 
der a single set of simulation parameters. To provide a much 
wider scope of simulation settings, we created a web-based 
shiny app, that serves up pre-computed results over a range 
of simulation parameters, including different datasets, sam- 
ple sizes and so on. In addition, it allows users to plot re- 
sults for only the subset of desired methods and metrics 
from Figure 4. While new methods can only be added to 
the shiny app by us, existing simulations can be easily recre- 
ated in a local R environment or additional settings can be 
added, as described in the Supplementary Note. In general, 
the conclusions observed from the broader range of simula- 
tion settings (e.g. different magnitudes of DE, sample sizes) 
are in agreement with those mentioned above (see also Sup- 
plementary Figure SI 1). 

Across multiple simulations over a range of settings, edgeR- 
robust is somewhat liberal but maintains a strong power-to- 
achieved-FDR tradeoff 

To complement the simulation results for individual param- 
eter settings, we endeavored to create a compact summary 
of a wider range of simulations and explore another impor- 
tant aspect of the comparison: do methods accurately con- 
trol false discovery rate? Figure 5 shows a series of 15 simu- 
lations divided into three different blocks based on the de- 
gree of difficulty: 'hard' (foldDiff e [2, 2.2]), 'medium' (fold- 
Diff e [3, 3.3]) and 'easy' (foldDiff e [6, 6.6]), including five 
simulations within each group to illustrate sampling vari- 
ability. For each dataset, lines connect the true positive rates 
and achieved FDRs across three thresholds of the estimated 



e91 Nucleic Acids Research, 2014, Vol. 42, No. 11 



PAGE 8 OF 10 



FDR (0.02, 0.05, 0. 1). The rest of the simulation parameters 
are kept fixed: the NB model parameters originate from the 
Pickrell dataset (10), there are 10 000 features, we consider a 
two-group comparison (5 versus 5), 10% of features are DE 
and each dataset contains 10% 4 S' outliers; comparisons for 
3 versus 3 and 10 versus 10 are shown in Supplementary 
Figure SI 2. 

Overall, there is a broad range of power-to-achieved- 
FDR tradeoffs and no method dominates. EBSeq appears 
to lack power and can be both liberal and conservative. In 
general, DESeq is conservative and achieves lower power, as 
reported earlier (6). Altogether, the collection of methods, 
such as limma-voom, edgeR, edgeR-robust and DESeq2 
achieve similar power-to-achieved-FDR tradeoffs across 
the sample sizes, with perhaps a tendency to be more lib- 
eral in large sample sizes for edgeR and edgeR-robust. As 
expected and as highlighted above, edgeR-robust appears to 
have advantages in the presence of outliers, with only a mi- 
nor decrease in power when no outliers are present. Thus, 
edgeR-robust achieves a good tradeoff between power at the 
same achieved FDR, even if the target FDR is not quite 
met. Notably, DESeq2 offers a small advantage in power at 
low log-fold-changes while suffering a little bit in power for 
higher log-fold-changes. In all cases, limma-voom controls 
FDR well and maintains high power. 

Outliers may originate from technical or biological sources 

While the strategy based on observation weights appears 
useful for dampening the effect of outliers in differential 
expression analysis, it may also be of interest to investi- 
gate the origin of such outlying observations. In some cases, 
we know of technical artefacts that affect the profile of 
RNA-seq expression data, such as sample- specific GC con- 
tent biases, as highlighted and mitigated by the analyses of 
the HapMap consortium as well as in follow-up methodol- 
ogy development (e.g. conditional quantile normalization 
(35)). In this dataset, there are no experimental conditions 
to detect differential expression, so we fit an intercept-only 
model, using the iterative robust estimation scheme. Not 
surprisingly, we first observe that the two samples high- 
lighted by Hansen et al. also exhibit a relatively higher 
number of down weighted observations (Figure 6a). As ex- 
pected, the degree of down-weighting is strongly related to 
the GC content of the cDNA sequences of the genes in- 
volved (Figure 6b). 

In an unrelated dataset from Blekhman (36) comparing 
expression in human male and female livers, we observe that 
the most significantly overrepresented functional categories 
were strongly associated with the set of down-weighted 
genes from a single sample (SRX014822and3, green cir- 
cles in Figure 6c). These include several categories involving 
the extracellular matrix, as well as collagen catabolism and 
plasma membrane. We show the third most overrepresented 
category, 'extracellular matrix' (Figure 6c) because the size 
of this category allows individual genes to be visualized (fur- 
ther details are given in Supplementary Table S2). Although 
we cannot confirm the exact cause of the overrepresented 
gene ontology categories, we note that accumulation of col- 
lagen and excessive production of extracellular matrix pro- 
teins are associated with the development of liver fibrosis 




Figure 6. Technical ((a) and (b)) and biological (c) sources of outlier genes. 
The number of down weighted observations (a) and distribution of out- 
lier weights as a function of the gene GC% in three samples from the 
HapMap RNA-Seq data (10) are plotted (b). Two of the samples shown 
(NA1 1918 and NA 12761) were shown by Hansen et al. to have strong, op- 
posing relationships between GC% and mapped reads per kilobase per mil- 
lion reads (RPKM). The third sample (NA12156) had the least number of 
genes down weighted after applying our robust down weighting procedure, 
(c) The log(CPM) and the inverse of the down weighting value for genes 
in the 'extracellular matrix' gene ontology category, where a value of one 
indicates no down weighting and larger inverse weights indicate stronger 
down weighting. 

(e.g. 37,38), and we suggest that analyses such as these may 
assist biologists in identifying the source of outliers in gene 
expression. 

DISCUSSION 

Various method developers have shown that statistical 
methods for discerning differential expression from RNA- 
seq data represented as counts can be sensitive to outlying 
observations. In this report, we have studied in detail the 
effects of outliers on various approaches and developed a 
new method based on observation weights that can detect 
and dampen the effect of outliers. In fact, it requires a deli- 
cate tradeoff to maintain high power while at the same time 
achieving a decent resistance to the presence of outliers. In 
particular, it is difficult to know exactly what an outlier is 
and where the line should be drawn to identify it as such. 
In this respect, we take a 'smooth' approach of dampening 
their effects, when there is evidence to support departure 
from the model. We have also explored the origin of such 
outliers and in some cases, we may be able to identify ei- 
ther a technical or biological effect to explain them. Our 
robust approach follows the strategy of classical robustness 
methods that are commonly applied to the linear regression 
problem. In our approach, we adopted the calculation of the 
residuals and observation weights to the specifics of the flex- 
ible dispersion estimation and standard GLM regression es- 
timation of the negative binomial model. 

As mentioned above, one reason that edgeR is sensi- 
tive to outlying observations is that the dispersion esti- 
mate used in the downstream inference is pulled toward the 
dispersion-mean trend, which may underestimate the vari- 
ability. Therefore, another way to dampen the effect of out- 
liers is to decrease the degree of moderation toward the 
dispersion-mean trend. Although we have not studied it 
here, there is again a delicate tradeoff between the degree of 
moderation to use and the average inference performance; 
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it still remains an open question as to how exactly to set this 
value for a given dataset. 

Though motivated and tested on real datasets, we em- 
ployed simulations to explore the broad range of possible 
settings and developed a comprehensive system for such 
evaluations. Our strategy to mimick real datasets is to take 
the joint distribution of mean and dispersion estimates from 
a large dataset as the basis for parameters to sample from. 
From such a dataset, outliers and differential expression at 
a specified level can be readily introduced. In fact, because 
these are estimates and not true values, we expect the sam- 
pled dispersion to potentially exhibit more variation than 
observed in a real dataset. In terms of evaluating the dif- 
ferent methods across the spectrum of simulation settings, 
it is important to consider it from all points of view: false 
discoveries amongst the list of top called features, the abil- 
ity to separate the truly differential from non-differential 
(i.e. ranking by statistical evidence), the statistical power at 
thresholds that are typically used in practice and the degree 
to which methods achieve their purported false discovery 
rates. 

Overall, the observation weight robust method performs 
well and achieves the goal of suffering only minimal loss of 
power, while maintaining resistance to introduced outliers. 
We have investigated the outlier policy in other packages 
and highlight that smoothly down-weighting outlying ob- 
servations appear preferable. In DESeq, a hard line against 
outliers is taken by using the maximum of a dispersion- 
mean trend and the individual estimate; with the addition 
of outliers, this has a global effect of increasing the vari- 
ance to all features and gives a resulting loss of power. In 
DESeq2, a Cook's distance metric is used to remove fea- 
tures with outliers entirely from further consideration; in 
this case, features that have outliers and differential expres- 
sion are excluded, with a potential loss of power. It is some- 
what of a philosophical decision as to whether to completely 
filter out features or to down-weight them; the observation 
weight strategy allows both. 

Another important consideration is the required sample 
size to be able to achieve estimators that are resistant to out- 
liers. Indeed, the lowest levels of replication (e.g. two sam- 
ples per condition) will not be sufficient. The minimum level 
of replication to dampen effects of outliers is three samples 
per condition, but this is the limit of any robust procedure. 

With the simulation system that we have created, we can 
now make a call to the community of both developers and 
users to check the effect of various settings. All that is re- 
quired to test a new method and compare it against exist- 
ing methods is to write a wrapper function with the correct 
inputs and outputs. In addition, if the exact simulation set- 
tings that we use in this report are not adequate, we can 
easily extend this framework into an open testing system 
that allows additional variations on the sampling model, 
perhaps including additional distributions or constructed 
truths, such as plasmodes (39). 

The current edgeR framework does not always achieve 
its false discovery rate target. However, even if it is forced 
to be more conservative, it still achieves power as good or 
better than existing approaches across the simulation set- 
tings that we have tested, even with the addition of observa- 
tion weights. The exact source of the liberality is beyond the 



scope of the current investigation, but there may be room 
for improvement, such as borrowing ideas from small sam- 
ple asymptotic approximations (40). 

CONCLUSION 

We developed an approach to dampen the effect of out- 
liers on count-based differential expression analyses. Over- 
all, the method appears to achieve the desired 'efficiency': 
a resistance to outliers while maintaining high power. We 
provided an implementation for the edgeR Bioconductor 
package, but the re-weighting idea could easily be adopted 
to other packages. In addition, we developed an extensi- 
ble simulation system (at the count table level) that readily 
performs the simulations based on an existing dataset and 
provides the infrastructure for producing the standard bat- 
tery of evaluations. In particular, this allows new methods 
or variations (e.g. alternative settings) of existing packages 
to be quickly explored. Instead of preparing a large num- 
ber of Supplementary Figures, we provide an interactive 
web-based shiny 4 app' to display simulation results across 
a broad range of simulation settings. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online, includ- 
ing[l,2]. 
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